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Abstract 

Background: Methods for linking real-world healthcare data often use a latent class model, where the latent, 
or unknown, class is the true match status of candidate record-pairs. This commonly used model assumes that 
agreement patterns among multiple fields within a latent class are independent. When this assumption is violated, 
various approaches, including the most commonly proposed loglinear models, have been suggested to account for 
conditional dependence. 

Methods: We present a step-by-step guide to identify important dependencies between fields through a 
correlation residual plot and demonstrate how they can be incorporated into loglinear models for record linkage. 
This method is applied to healthcare data from the patient registry for a large county health department. 

Results: Our method could be readily implemented using standard software (with code supplied) to produce an 
overall better model fit as measured by BIC and deviance. Finding the most parsimonious model is known to 
reduce bias in parameter estimates. 

Conclusions: This novel approach identifies and accommodates conditional dependence in the context of record 
linkage. The conditional dependence model is recommended for routine use due to its flexibility for incorporating 
conditional dependence and easy implementation using existing software. 



Background 

Health information exchanges (HIE's), with highly hetero- 
geneous data, are becoming increasingly important sources 
of integrated clinical data supporting many healthcare tasks 
and health-related research. HIE data are captured from 
different independent databases with different patient iden- 
tifiers, and best practices for implementing and operating 
HIE's are needed. Specifically with respect to data integra- 
tion and patient matching, in its formal recommendations 
to the Director of the Office of the National Coordinator 
for Health Information Technology (HIT) in 2011, the HIT 
Policy Committee recognized the need to develop and dis- 
seminate best practices for patient matching [1] because 
best practices for matching data in HIE's are lacking. 

Many methodologies have been proposed to identify 
records in two or more databases that are related to the 
same entity. Deterministic approaches are based on ad-hoc 
rules, which classify a pair of records as matches if the two 
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records satisfy certain conditions. Although straightforward 
to implement, deterministic approaches are often too con- 
servative with unacceptably high false negative (missed- 
match) rates, especially when data are noisy [2]. This may 
lead to suboptimal care since physicians lack the informa- 
tion necessary to make informed medical decisions. 

Distance-based methods that can handle numerical or 
categorical fields, as described in [3], are another method 
to link records. These methods have been shown to per- 
form similarly to probabilistic methods for both numeric 
[4] and categorical data [5] but require one to establish ap- 
propriate distance measures for each variable under con- 
sideration. They are not investigated further here as they 
are not commonly used in practice and have not yet been 
investigated thoroughly in the HIE setting although they 
may be of interest in future work [6] . 

Another alternative to deterministic linkage methods 
are probabilistic methods. A common probabilistic record 
linkage method was proposed by Fellegi and Sunter in 
1969 [7]. This model is a latent class model, where the la- 
tent, or unknown, class represents the true match status 
of the record pair. For this model, each field contained in 
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both data sources is compared as a record pair and a bin- 
ary variable is created which is a 1 if the two fields agree 
and 0 otherwise; thus a binary vector is created for each 
record pair. The Fellegi-Sunter (F-S) model assumes that 
the agreement patterns of the fields are independent con- 
ditional on the true match status. 

This conditional independence assumption is often vio- 
lated in real-world record linkage scenarios [8]. When con- 
ditional independence does not hold, estimates of model 
parameters can be substantially biased [9]. This bias can 
lead to inaccurate record linkage outcomes as described 
previously [2,8]. Therefore, finding the most parsimonious 
model that accounts for the conditional dependence will 
provide the most accurate classification of record pairs. 

Various methods have been proposed to address the 
lack of conditional independence in latent class models for 
record linkage. For example, Tromp et al. incorporated 
conditional dependence between two fields by combining 
them into one field with four nominal levels of agreement 
[2]. This strategy can be cumbersome if conditional de- 
pendence exists between more than two fields since the 
number of nominal categories increases when combining 
agreement patterns for multiple fields. Schiirle proposed 
an alternate approach to incorporating conditional depen- 
dence in the traditional F-S model framework by working 
directly with the joint distribution of the observed agree- 
ment pattern given the true match status. However, this 
model involves heavy parameterization that leads to sig- 
nificant overfitting of the model [10]. For example, when 
seven fields are used for record matching, this model in- 
volves 255 parameters, while the data could estimate at 
most 127 parameters. Due to the extreme complexity of 
the model, the proper choice of starting values is critical 
for parameter estimation. This greatly limits the usefulness 
of the approach due to the computational effort required 
to examine multiple starting values. 

Latent class models with conditional independence 
can be equivalently formulated using a loglinear framework 
[11,12]. Using this formulation, the conditional independ- 
ence assumption can be readily relaxed to account for con- 
ditional dependence among fields by including interactions 
among fields within the match class or the nonmatch class 
or both [13]. Such loglinear approaches incorporating in- 
teractions within latent classes have been used in many ap- 
plications, notably in diagnostic testing [14,15]. 

Similarly, loglinear models have been applied to record 
linkage applications. Using survey data whose record pairs 
had known match status, Thibaudeau identified fields with 
conditional dependence using a loglinear model with se- 
lected interactions [8]. Winkler estimated a loglinear model 
using three-way interactions, acknowledging that identify- 
ing the correct set of interactions is difficult when a large 
number of fields are involved [16]. Loglinear models with 
certain interaction terms have also been applied in record 



linkage by Larsen and his colleagues [17,18]. There has 
been no research on effectively identifying appropriate in- 
teractions in record linkage until the stepwise model build- 
ing strategy for identifying interactions recently proposed 
by Zhu et al. [19]. However, this approach can only identify 
models with all interactions of the same order. 

Many previous record linkage studies focused largely 
on maximum likelihood (ML) estimation, where the par- 
ameter estimates of the loglinear model were obtained 
using an Expectation-Maximization (EM) algorithm. For 
situations such as the latent class model where incom- 
plete data (unobserved classes) are involved, the EM al- 
gorithm is a powerful tool to estimate model parameters 
[20]. However, as noted by Winkler (1995), the EM algo- 
rithm takes substantially longer to reach convergence when 
conditional dependence is incorporated in the loglinear 
model because the M-step does not have a closed-form so- 
lution [21]. Alternatively, estimating the loglinear latent 
class model can be conveniently implemented using rou- 
tines in existing software, such as SAS 8 PROC NLMIXED 
(Cary, NC), thus providing a pragmatic approach to in- 
corporating conditional dependence more efficiendy. 

Even though loglinear models have been proposed by 
multiple authors for handling conditional dependence in 
HIE, implementation of such models requires custom- 
ized programs and the process for choosing pairwise in- 
teractions in these models has not been specified. We 
therefore describe and evaluate a method for identifying 
conditional dependence among fields, which are subse- 
quently incorporated as interactions in a loglinear model 
fitted using standard software. To illustrate the method- 
ology, we use an application linking a client list of a 
county health department to itself for de-duplication. The 
step-by-step method described is supplemented by sample 
code which can be readily modified for linking any two 
data sets using standard statistical software. 

Methods 

We first describe a loglinear formulation of the extended 
F-S model with conditional dependence. Let M be the 
true match status of a pair of records (M = 1 for true 
match and M = 0 for true non-match). For each record 
pair with K fields, an agreement vector is observed 

Y = {Ji,J 2 >-i^}> 

where yj = 1 if the i field agrees and 0 otherwise. The match 
prevalence is defined as the proportion of vector patterns 
belonging to the true match record class and is n = P(M = 1). 
The parameters of the classical F-S model include 

0= {m l ,m 2 ,...,m K ,ui,u 2 ,.... l UK,Ji} 

where the w-probabilities are the probability of field 
agreement given the record pair is a true match, and 
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the ^-probabilities are the probabilities of field agree- 
ment given the record is a true non-match. 

To more effectively accommodate conditional indepen- 
dence, the traditional F-S model can be reparameterized 
using a loglinear formulation, where the mean number of 
record pairs with agreement pattern Tand match status M 
is given as follows: 

log m{ Y , M) = A + X M M + X^ =1 A <* + Yl=M kM y* ■ ( 1 ) 

With K fields, there are D = 2 K possible different 
agreement patterns. Let^ represent the frequency count 
for the agreement pattern Y d (d = 1,2,...,D). Then the 
log-likelihood is given by 

l(e\Y) = J2ljJog(P(Y d )), 

where the marginal probability of observing the agree- 
ment pattern Y d is 



P(Y d ) 



m(Y d ,M d 



m(Y d ,M d = 0) 



Z rf=1 {m{Y d ,M d = 1) + m{Y d ,M d = 0)} 



(2) 



The match score for a specific agreement pattern Y d is 
defined as 



j P(Y d \M d = l) 
l0g2 \P(Y d \M d = 0) 



The loglinear formulation has been shown to be 
equivalent to the F-S classical probabilistic formulation 
of the conditional independence latent class model [12] 
through the following relationships: 



I.Jl 1 m(Y d ,M d = l) + -Zjl 1 m(Y d ,M d = 0) 

exp(A k + \ Mk ) 
1 + exp(A^ + X M k) ' 

exp(Afr) 
1 + exp(A /( ) 



m k 



To incorporate conditional dependence in the loglinear 
model setting, we add the appropriate interaction terms to 
the model. For example, if there is dependence between 
fields / and / within each latent class, the model then in- 
cludes two additional terms: 

\ogm(Y,M) = A + X M M + Yf^Wk + Y^ k =i XmM ^ ( 3 ) 
+hiyjyi + x wMy j y l . 

The above loglinear model with interaction terms is easy 
to fit in standard statistical software such as SAS (example 



code is provided in Additional file 1). The goodness of fit 
of a model is measured by both the deviance G 2 and the 
Bayesian Information Criterion (BIC). We use deviance to 
compare nested models. A model with lower deviance 
provides a better fit to the data and hence will be pre- 
ferred. For models that are not nested within each other, 
BIC is the most commonly used criterion for latent class 
modeling as it takes into account the sample size [22]. 
The model with a lower BIC is preferred. 

In what follows, we describe a series of steps to fit a 
loglinear model with appropriate interactions. Specifically, 
we follow a six-step procedure by identifying the pairwise 
dependencies between fields using the correlation residual 
plot proposed by Qu, Tan, and Kutner [23]. We then in- 
corporate the correlations into the model and re-examine 
the fit of the new model. We iterate between these steps 
as follows: 

Step 1 

Fit a loglinear model with no interactions using the ob- 
served agreement vectors. This is simply the F-S model 
formulated as a loglinear model, which provides initial 
parameter estimates for the next model. Obtain deviance 
and BIC of this conditional independence model. See 
Additional files 1, 2 and 3 for SAS code with example. 

Step 2 

Compute the observed pairwise correlation between fields 
/ and /. The correlation between yj and yi is 



Corr,/ 



Pft-PjPl 



^Pj( l -Pj)Pi{ l -Pi) 



(4) 



where pj = P(yj = 1) , p t = P{y t = 1), and p n = P(yj = l,y t = 1). 
Using the observed data, the estimates for pp p b and pu 
are given by: 



d —-^J d y d j / j j—-\J d y d l j / 1 JdJ dijdl 

V" f ' V" f V D f : 

Z-^ d =i Jd Z-^ d =\ Jd L^ d =\ Jd 



respectively. 
Step 3 

Substitute the parameter estimates of As from the fitted 
model in Step 1 into Equation (1) to obtain the expected 
number of record pairs m(Y dl M d ) for each vector pattern 
Y d and match status M d , Calculate the expected mar- 
ginal probability P(Y d ) using Equation (2) and the ex- 
pected cell count f d = NP{Y d ) for each vector pattern, 
where N = ~Y^d-J d ls ' ne number of record pairs. Ex- 
pected pairwise correlations are then estimated using (4) 
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(same formulas in Step 2) based on the expected counts f d 
rather than the observed counts f d . 

Step 4 

Compute the correlation residual, which is equal to the dif- 
ference between the observed correlation and the expected 
correlation for each pair of fields. Plot the residuals across 
the different pairs of fields. A correlation residual which is 
much different from zero would imply dependence for the 
corresponding pair of fields. 

Step 5 

Incorporate the conditional dependence between the pair 
of fields identified in Step 4 as the interaction term in the 
loglinear model. Specifically, fit the following four models: 
interaction in the match class only, interaction in the 
nonmatch class only, interaction in both classes with dif- 
ferent coefficients, and interaction in both classes with the 
same coefficients. Since the four models are not all nested, 
BIC is used to compare them and the model with the low- 
est BIC is chosen. Repeat Steps 3 through 5 to obtain the 
expected number of record pairs m(Y d ,M d ) by substitut- 
ing the parameter estimates of A's of the chosen model 
into Equation (3) with appropriate interactions instead 
of Equation (1) until no large correlation residuals are 
apparent. 

Step 6 

To classify individual pairs as match, non-match or un- 
certain matches, we use the final model parameter esti- 
mates to calculate the match score for each agreement 
pattern. Record pairs are then declared as matches or 
non-matches based on these match scores. 

Approval to perform this study was obtained from the 
Indiana University Institutional Review Board: approval 
number 1010002784 (0909-68). De-identified data for the 
HIE example described in the next section is provided as 
Additional file 3. 

Results 

Description of the HIE dataset 

We applied the above steps to de-duplicate the client 
registry for the Marion County Health Department 
(MCHD). De-duplication is a class of record linkage 
where a data set is linked to itself to identify potential 
duplicate records. MCHD is a member of the Indiana 
Network for Patient Care, the nation's largest and lon- 
gest tenured HIE [24]. 

The MCHD client registry contains 779,466 patient re- 
cords gathered from multiple public health service areas. 
Therefore this data is highly heterogeneous and the 
method of input may be any combination of standardized 
electronic entry, paper entry, or manual entry for a given 



field. Since the total number of all potential record pairs is 
extremely large (3 x 10 11 potential pairs), the data were 
first blocked to minimize the search space for potential 
matches. The MCHD client registry was blocked on last 
name and first name, thus only record pairs agreeing on 
these two fields are contained in the analysis. This reduced 
the number of potential record pairs to 618,213. The re- 
maining fields in this dataset include day, month, and year 
of birth, social security number, telephone number, zip 
code, and gender. Level of missing across the different 
fields varies from as low as 0% for day, month, and year of 
birth to as high as 95% for specific identifiers such as SSN. 
Missing values were coded as disagreements. We then ap- 
plied the six-step process described above to all pairs from 
this block. 

Application to HIE dataset 

As described in the previous section, we first fit the con- 
ditional independence model (Step 1). With 7 fields, this 
model contains 15 parameters (Model 0). Parameter esti- 
mates are provided in Table 1. The overall match preva- 
lence was estimated to be 3.7% under this model. The 
w-probabilities ranged from a low of .025 (SSN) to a 
high of 0.716 (month of birth), indicating that only 2.5% 
of the matched record pairs agreed on SSN while 71.6% 
agreed on month of birth. The estimated w-probabilities 
were small as expected, except for sex, which was esti- 
mated to be 0.661, indicating that 66.1% of non-matching 
record pairs agreed on sex. Also, the M-probability for SSN 
was nearly zero, indicating that very few non-matching 
record pairs agree on SSN. The deviance of this model 
was G 2 = 8852.9. Assuming independence, the observed 
and expected pairwise correlations were calculated (Steps 
2, 3) and the differences were displayed in the correlation 
residual plot (Step 4), as shown in Figure 1 (Panel A). 

The seven fields in this particular dataset yield 21 pair- 
wise correlations. The difference between the 21 observed 
and expected pairwise correlations from this model ranged 
from -0.027 to 0.155. The majority of the correlation re- 
siduals from the conditional independence model fluctuate 
between -0.03 and 0.05. However, the correlation residual 
between the fields telephone number and zip code is much 
larger than the others (almost 5-fold difference), indicating 
a violation of the conditional independence assumption 
for this pair of variables. 

To accommodate the conditional dependence between 
telephone number and zip code, we followed Step 5 and 
compared the fit of four models, specifically, models with 
interaction in the match class only, interaction in the non- 
match class only, interaction in both classes with different 
coefficients, and interaction in both classes with equal 
coefficients. The model with interactions between tele- 
phone number and zip code in both classes with different 
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Table 1 Loglinear model results for last name/first name block 

Model 0 Model I Model II 

(Conditional independence) 



Field 


Parameter 


Estimate 


Std Error 


Estimate 


Std Error 


Estimate 


Std Error 




7T 


0.037 


0.0004 


0.035 


0.0004 


0.035 


0.0004 


Year of birth 


m, 


0.581 


0.0047 


0.615 


0.0048 


0.615 


0.0048 


SSN 


m 2 


0.025 


0.001 1 


0.027 


0.001 1 


0.026 


0.001 1 


Day of birth 


m 3 


0.572 


0.0046 


0.608 


0.0048 


0.608 


0.0048 


Telephone 


m 4 


0.173 


0.0029 


0.141 


0.0026 


0.140 


0.0026 


Zip code 


IDj 


0.409 


0.0041 


0.363 


0.0040 


0.362 


0.0040 


Sex 


m 6 


0.710 


0.0037 


0.695 


0.0038 


0.694 


0.0038 


Month of birth 


m 7 


0.716 


0.0044 


0.768 


0.0044 


0.769 


0.0045 


Year of birth 


Ul 


0.026 


0.0002 


0.026 


0.0002 


0.026 


0.0002 


SSN 


u 2 


6E-06 


8E-06 


1E-05 


9.00E-06 


5E-05 


1E-05 


Day of birth 


u 3 


0.032 


0.0003 


0.031 


0.0003 


0.031 


0.0003 


Telephone 


u 4 


5E-04 


0.0001 


0.002 


0.0001 


0.002 


0.0001 


Zip code 


Us 


0.037 


0.0003 


0.039 


0.0003 


0.039 


0.0003 


Sex 


Ue 


0.661 


0.0006 


0.661 


0.0006 


0.661 


0.0006 


Month of birth 


u 7 


0.082 


0.0004 


0.081 


0.0004 


0.081 


0.0004 



G 2 8852.9 2974.26 2881.45 



Loglinear model results for MCHD data blocked on last name and first name (Number of record pairs = 618,213). All parameters are statistically significant 
(p < .001) for all three models, except for u 2 which is not significant for conditional independence model (p = .468) or Loglinear Model I (p = .143). 



coefficients yielded the lowest BIC and hence was selected 
(Model I). 

The parameter estimates of the match prevalence and 
m- and K-probabilities for Model I are also shown in 
Table 1. Model I provided a better fit to the data compared 
to the conditional independence model, which is indicated 
by its much lower deviance = 2974.3. The estimated match 
prevalence was 3.5%, slightly lower than the estimate 
under the conditional independence model. The estimated 
w-probabilities were quite similar under both models. The 
difference was seen in the estimated m-probabilities, with 
an increase in the estimates corresponding to year, month, 
day of birth and SSN and a decrease in estimates corre- 
sponding to telephone, zip code and sex compared to the 
conditional independence model. 

After calculating the expected correlations under Model 
I (repeat Step 3), Figure 1 (Panel B) shows the correlation 
residuals for pairs of fields (repeat Step 4). The correlation 
residual for telephone number and zip code is no longer 
present since this pairwise dependence has been accounted 
for by the model. The magnitude of all correlation resid- 
uals were under 0.05. Our experience suggests that an ab- 
solute value greater than 0.05 for the correlation residual is 
a reasonable approximate guideline for identifying condi- 
tional pairwise field dependence. 

Although the correlation residual plot did not reveal sub- 
stantial deviation of the conditional dependence between 
other pairs of fields, the correlation residual between SSN 



and telephone number (0.047) was more than twice of the 
magnitude of the remaining residuals. To examine whether 
it is appropriate to consider conditional dependence be- 
tween this pair of fields, we repeated the Step 5 and com- 
pared the fit of four additional models. These four models 
again include models with interaction between SSN and 
telephone number in the match class only, interaction in 
the non-match class only, interaction in both classes with 
different coefficients, and interaction in both classes with 
equal coefficients. Based on the BIC, the model with inter- 
actions in both classes with different coefficients was se- 
lected (Model II). 

Model II provided a better fit to the data compared to 
Model I with a smaller deviance and BIC, as well as smaller 
correlation residuals (Figure 1 Panel C). However, the im- 
provement of the model fit was marginal. This is indicated 
by its deviance G 2 = 2881.45, which was only slightly less 
than the deviance of Model I. As a result, parameter esti- 
mates under Model II did not differ much from the less 
complex Model I. As this is consistent with our general 
guideline regarding correlations above .05, we chose Model 
I as our final model. 

Patient records were then classified as match or non- 
match based on the estimated match prevalence from these 
three models. The conditional independence model classi- 
fied 1,152 record pairs as matches. Model I classified 1,082 
matches and Model II results were almost identical to 
Model I with 1,081 matches. Thus not accounting for the 
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Conditional Independence (Model 0) 




Pairwise Correlation 
Panel A 












0.10- 
0.05- 
0.00- 



T~ T~ CM T~ 



Pairwise Correlation 
Panel B 




T~ T~ CM T- 



Pairwise Correlation 
Panel C 



Figure 1 Correlation residual plots for last name/first name block. Last name/first name block: pairwise correlation residuals for Model 0 
(Panel A), Model I (Panel B), and Model II (Panel C). 



conditional dependence yielded the largest number of de- 
clared duplicates; likely producing falsely-merged records 
resulting in lost patient records. Since it was not the pur- 
pose of our study to assess the accuracy of the different 
models, we did not manually ascertain the true match sta- 
tus of the records. We refer the readers to the literature 
[2,7] for examples where accounting for the conditional de- 
pendence may improve the performance of the matching 
algorithm. 

Discussion 

For many record linkage applications, the assumption of 
conditional independence for field agreement is often vio- 
lated and ignoring the conditional dependence may lead 
to a suboptimal record matching accuracy. To optimize 
matching accuracy, it is important to examine whether 



conditional dependence exists and to incorporate such de- 
pendence in the model in a proper way. 

In this paper, we presented a step-by-step procedure to 
identify and incorporate conditional dependence among 
fields using a loglinear latent class model. This stepwise 
method can be implemented using standard statistical soft- 
ware. In contrast to previous studies where loglinear latent 
class models were estimated using the iterative EM algo- 
rithm, we proposed estimating parameters using the readily 
available SAS procedure NLMIXED. Our step-by-step 
process was applied to the de-duplication of the MCHD cli- 
ent registry. The results indicated that conditional depend- 
ence can be readily identified using a graphical approach 
and the model with appropriate conditional dependence 
provided a much better model fit than the conditional inde- 
pendence model. 
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Although a stepwise variable-selection strategy was pre- 
viously proposed by Zhu et al. [19] that accounted for the 
conditional dependence using interaction terms in log- 
linear models, the model building process proposed in this 
paper is different in several important ways. First, the 
quasi-Newton approach implemented in SAS NLMIXED 
procedure used in this paper is more efficient than the 
EM algorithm. Second, the final model obtained by the 
Zhu et al. approach must include all interactions of 
the same order, while our approach can identify specific 
two-way interactions. When a large number of fields are 
involved in record linkage, examining all interactions of 
the same order will introduce a large number of additional 
parameters. Model estimation can become quite complex 
with many local maxima. In addition, important condi- 
tional dependencies may not be detected using Zhu et al.'s 
approach when only a few interaction effects exist. 

In addition to loglinear models, latent class models with 
conditional dependence have been extensively studied and 
widely used in other domains. For example, in diagnostic 
testing, latent class models with random effects [23,25], 
probit latent class models [26], and finite mixture models 
[27] have all been used to evaluate the accuracy of diag- 
nostic tests when no gold standard is available to evaluate 
the true disease status. Further investigation is needed to 
examine how the other conditional dependence models 
compare to loglinear models and to determine the impact 
of incorporating conditional dependence in record linkage. 

A potential limitation of our approach is that it is more 
labor intensive because it requires understanding how to fit 
loglinear models. Additionally, the loglinear model frame- 
work requires parameterization that is not as readily under- 
stood by practitioners. The approach is iterative thus is also 
more computationally intensive. However, these challenges 
are mitigated by providing example code as additional files. 

Conclusions 

We have proposed a novel and practical approach to iden- 
tify and incorporate conditional dependence in record link- 
age. Compared to the commonly used F-S model, the 
conditional dependence model provides substantially better 
fit to the data when conditional dependence exists. Given 
that some fields commonly used for linking health records 
often have correlated agreement patterns, we recommend 
the routine use of our proposed methods to avoid model 
misfit. Our approach can be easily followed using the step- 
by-step instructions and the sample code provided. 

Additional files 



Additional file 1: MCHD Loglinear Models.sas. SAS program that uses 
the loglinear approach described in this paper to fit the MCHD dataset. 
Requires SAS* software available from SAS Institute Inc., Cary, NC, USA. 



Additional file 2: corr macro.sas. SAS macro to compute the 
correlation residual. Requires SAS" software available from SAS Institute 
Inc., Cary, NC, USA. 

Additional file 3: MCHD data.csv. MCHD data supplied in standard 
csv format. 
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